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Abstract 

In this paper, we investigate the theoretical guarantees of penalized ^i-minimization (also 
called Basis Pursuit Denoising or Lasso) in terms of sparsity pattern recovery (support and 
sign consistency) from noisy measurements with non-necessarily random noise, when the 
sensing operator belongs to the Gaussian ensemble (i.e. random design matrix with i.i.d. 
Gaussian entries). More precisely, we derive sharp non-asymptotic bounds on the sparsity 
level and (minimal) signal-to-noise ratio that ensure support identification for most signals 
and most Gaussian sensing matrices by solving the Lasso with an appropriately chosen 
regularization parameter. 

Our first purpose is to establish conditions allowing exact sparsity pattern recovery 
when the signal is strictly sparse. Then, these conditions are extended to cover the com- 
pressible or nearly sparse case. In these two results, the role of the minimal signal-to-noise 
ratio is crucial. Our third main result gets rid of this assumption in the strictly sparse 
case, but this time, the Lasso allows only partial recovery of the support. We also provide 
in this case a sharp ^2-consistency result on the coefficient vector. 

The results of the present work have several distinctive features compared to previous 
ones. One of them is that the leading constants involved in all the bounds are sharp 
and explicit. This is illustrated by some numerical experiments where it is indeed shown 
that the sharp sparsity level threshold identified by our theoretical results below which 
sparsistency of the Lasso solution is guaranteed meets the one empirically observed. 

Key words: Gompressed sensing, ii minimization, sparsistency, consistency. 



1. Introduction 

1.1. Problem setup 

The conventional wisdom in digital signal processing is the Shannon sampling theorem 
valid for bandlimited signals. However, such a sampling scheme excludes many signals of 
interest that are not necessarily bandlimited but can still be explained either exactly or 
accurately by a small number of degrees of freedom. Such signals are termed sparse signals. 
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In fact we distinguish two types of sparsity: strict and weak sparsity (the latter is also 
termed compressibility). A signal x, considered as a vector in a finite dimensional subspace 
of RP, is strictly or exactly sparse if all but a few of its entries vanish; i.e., if its support 
I{x) = supp (x) = {1 < i < p I x[i] 7^ 0} is of cardinality k <^ p. A /c-sparse signal is 
a signal where exactly k samples have a non-zero value. Signals and images of practical 
interest may be compressible or weakly sparse in the sense that the sorted magnitudes 
I ^sorted ^^j I (jg^^y quickly. Thus X can be well-approximated as /c-sparse up to an error term 
(this property will be used when we will tackle compressible signals). If a signal is not 
sparse in its original domain, it may be sparsified in an appropriate orthobasis $ (hence 
the importance of the point of view of computational harmonic analysis and approximation 
theory). Without loss of generality, we assume throughout that ^ is the standard basis. 

The compressed sensing/sampling (l|,[2|,0| asserts that sparse or compressible signals can 
be reconstructed with theoretical guarantees from far fewer measurements than the ambient 
dimension of the signal. Furthermore, the reconstruction is stable if the measurements are 
corrupted by an additive bounded noise. The encoding (or sampling) step is very fast since 
it gathers n non- adaptive linear measurements that preserve the structure of the signal xq: 

y = Axo + w £W, (1) 

where A £ W^^^ is a rectangular measurement matrix, i.e., n < p, and w accounts for 
possible noise with bounded £2 norm. In this work, we do not need w to be random and 
we consider that A is drawn from the Gaussian matrix ensembl^ , i.e., the entries of A are 
independent and identically distributed (i.i.d.) AA(0, 1/n). The columns of A are denoted 
Cj, for i = 1, - ■ ■ ,p. In the sequel, the sub-matrix Aj is the restriction of A to the columns 
indexed by I{x). To lighten the notation, the dependence of / on x is dropped and should 
be understood from the context. 

The signal is reconstructed from this underdetermined system of linear equations by 
solving a convex program of the form: 

x E argmin \\x\\i such that Ax — y G C , (2) 

xeRP 

where C is an appropriate closed convex set, and \\x\\g := (^^ |a;[i]|'^)^'''', g > 1 is the iq- 
norm of a vector with the usual adaptation for q = 00: \\x\\^ = maxj \x[i]\. We also denote 
||x||g as the Iq pseudo-norm which counts the number of non-zero entries of x. Obviously, 
||x||q = |/(x)|. For any vector x, the notation x G RI-'^^^)! means the restriction of x to its 
support. 

Typically, if C = {0} (no noise), we end up with the so-called Basis Pursuit j3| problem 

min llxlli such that y = Ax . (BP) 

Taking C as the £2 ball of radius e, we have a noise-aware variant of BP 

mm such that — y||2 < e (£i-constrained) 

where the parameter e > depends on the noise level 111^112- This constrained form can 
also be shown to be equivalent to the ^i-penalized optimization problem, which goes by 
the name of Basis Pursuit DeNoising j3| or Lasso in the statistics community after |5|]: 

min — llv — + 7 llxlli , (Lasso) 

xeRp 2 "2 ' II 111 ' ^ ^ 

^In a statistical linear regression setting, we would speak of a random Gaussian design. 
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where 7 is the regularization parameter. ( ^1 -constrained ) and (jLassoj) are equivalent in the 
sense that there is a bijection between 7 and e such that both problems share the same set 
of solutions. However, this bijection is unknown explicitly and depends on y and A, so that 
in practice, one needs to use different algorithms to solve each problem, and theoretical 
results are stated using one formulation or the other. In thispaper, we focus on the Lasso 
formulation. It is worth noting that the Dantzig selector 01 is also a special instance of 
^ when C = {z E W\ \\A^ z\\^ < 7}. 

The convex problems of the form ( £i-constrainedD and (ILassoP are computationally 
tractable and many algorithms have been developed to solve them, and we only mention 
here a few representatives. Homotopy continuation algorithms 0, 10| track the whole 
regularization path. Many first-order algorithms originating from convex non-smooth op- 
timization theory have been proposed to solve (jLassop . These include one-step iterative 
thresholding algorithms U, 12, l^, 14|, or accelerated variants 15, 16|, multi-step schemes 
such as [17] or [18]. The Douglas-Rachford algorithm (lol. [2^ is a first-order scheme that 



can be used to solve d^i-constrainedD . A more comprehensive account can be found in 
Chapter 7]. 
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1.2. Theoretical performance measures of the Lasso 

These last years, we have witnessed a flurry of research activity where efforts have been 
made to investigate the theoretical guarantees of £1 minimization by solving the Lasso for 
sparse recovery from noisy measurements in the underdetermined case n < p. Overall, the 
derived conditions hinge on strong assumptions on the structure and interaction between 
the variables in A as indexed by xq. An overview of the literature pertaining to our work 
will be covered in Section [L3] after notions are introduced so that the discussions are clearer. 

Let Xq be the original vector as defined in ([1]), /o = Axq the noiseless measurements, 
x(7) a minimizer of the Lasso problem and 7(7) = Ax (7). 

Consistency, ^^-consistency on the signal x means that the £g-error ||xo — 2^(7) H^, for typ- 
ically g = 1, 2 or 00, between the unknown vector xq and a solution x^j) of either (ILassoP 
or " ~ 



'1- 



constrained ) comes within a factor of the noise level. 



Sparsistency. Sparsity pattern recovery (also dubbed sparsistency for short or variable 
selection in the statistical language) requires that the indices and signs of the solutions 
x(7) are equal to those of xq for a well chosen value of 7. Partial support recovery occurs 
when the recovered support is included (strictly) in that of xq with the correct sign pattern. 



In general, it is not clear which of these performance measures is better to characterize 
the Lasso solution. Nevertheless, in the noisy case, consistency does not tell the whole 
story and there are many applications where bounds on the iq-erioi are insufficient to 
characterize the accuracy of the Lasso estimate. In this case, exact or partial recovery 
of the support, hence of the correct model variables, is the desirable property to have. 
Among other advantages, this allows for instance to circumvent the bias of the Lasso and 
thus enhance the estimation of xq and Axq using a debiasing procedure: recover the support 
I by solvin g th e Lasso, followed by least-squares regression on the selected variables (aj)jg/; 
see e.g. (6l.l23|. Our work falls within this scope and focuses on exact and partial support 
identification for both strictly sparse and compressible signals in the presence of noise on 
Gaussian random measurements. 
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1.3. Literature overview 

The properties of the Lasso have been extensively studied, including consistency and 
distribution of its estimates. There is of course a huge literature on the subject, and 
covering it fairly is beyond the scope of this paper. In this section, we restrict our overview 
to those works pertaining to ours, i.e., sparsity pattern recovery in presence of noise. 

Much recent work aims at understanding the Lasso estimates from the point of view of 



sparsistency. This body of work includes [22, la, l23|, l2J, |25|, l26|, |27|, l28|, l29|. For the Lasso 
estimates to be close to the model selection estimates when the data dimensions {n,p) 
grow, all the aforementioned papers assumed a sparse model and used various conditions 
that require the irrelevant variables to be not too correlated with the relevant ones. 

Mutual coherence-based conditions. Several researchers have studied independently the 
qualitative performance of the Lasso for either exact or partial sparsity pattern recovery of 
sufficiently sparse signals under a mutual coherence condition on the measurement matrix 
A; see for instance 23, S^, 2^, 31| when A is deterministic, and (32| when A is Gaussian. 



However, mutual coherence is known to lead to overly pessimistic sparsity bounds. 

Support structure-based conditions. These sufficient recovery conditions were refined by 
considering not only the cardinality of the support but also its structure, including the 
signs of the non-zero elements of xq . Such criteria use the interactions between the relevant 
columns of Aj = {ai)i^j and the irrelevant ones {ai)i^j. More precisely, we define the 
following condition developed in [33] to analyze the properties of the Lasso. This condition 



goes by the name of irrepresentable condition in the statistical literature; see e.g. |28l . |22 



271 . |34| | and 35| for a detailed review. 



Definition 1. Let I be the support of xq and its complement in {1, • • • ,p}. The irrep- 
resentable (or Fuchs) condition is fulfilled if 



F{xo) := \\aJ.Aj{AJAi) isign(x^)||^ = max | (a^, (i(xo))| < 1, (3) 



where d{xo) := Ai{Af Aj) ^sign (x^j) . (4) 
Condition ([3]) will also be the soul of our analysis in this paper. 

The criterion ([3]) is closely related to the exact recovery coefficient (ERC) of Tropp [i^: 
ERC(xo) := 1 -max 11(^7 A/)-iAjai|L . (5) 



In |26l . Corollary 13], it is established that if ERC(xo) > 0, then the support of the Lasso 
solution with a large enough parameter 7 is included in the one of the subset selection (i.e., 
^o-ininimization) optimal solution. 



In 



28 1, an asymptotic result is reported showing that (j3|)G is sufficient for the Lasso 
to guarantee exact support recovery and sign consistency. It is also shown that ([3]) is 
essentially necessary for variable selection. [2^ develop very similar results and use similar 



req uirements. [36| and |37| derive asymptotic conditions for sparsistency of the block Lasso 



38| by extending ([3|) and ([5|) to the group setting. 



Reference [2j| proposes a non-asymptotic analysis with a sufficient condition ensuring 



exact support and sign pattern recovery of most sufficiently sparse vectors for matrices 



^In fact, a slightly stronger assumption requiring that all elements in ^ are uniformly bounded away 
from 1. 
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satisfying a weak coherence condition (of the order (logp)^^). Their proof rehes u pon 
([3]) and a bound on norms of random sub-matrices developed in (s^. The work in |27| 
considers a condition of the form ([3]) to ensure sparsity pattern recovery. The analysis 
in that paper was conducted for both deterministic and standard Gaussian ^4 in a high- 
dimensional setting where p and the sparsity level grow with the number of measurements 
n. That author also established that violation of ([3]) is sufficient for failure of the Lasso in 
recovering the sup port set. In ji^l, the sufficient bound on the number of measurements 
established in |27l | for the standard Gaussian dense ensemble was shown to hold for sparse 
measurement ensembles. The works of |22i] and are certainly the most closely related 
to ours. We will elaborate more on these connections by highlighting the similarities and 
differences in Section [2.41 



Variations on the Lasso. Other variations of the Lasso, such as the adaptive Lass 
or multi-stage variable selection methods (iil . 44, 45, |4^, 34 1. For an overview of other 



penalized methods that have been proposed for the purpose of variable selection, see [4 

Information-theoretic bounds. A recent line of research has developed information-theoretic 
sufficient and necessary bounds to characterize fundamental limits on minimal signal-to- 
noise ratio (SNR), the number of measurements n, and tolerable sparsity level k required 
for exact or partial support pattern recovery of exactly sparse sig nals by any al gori thm 



tor exact or partial support pattern recovery ot exactly sparse sig nals by any algoritnm 
including the optimal exhaustive io decoder 47, 4^, 49, 50, 51, 13 y, 0,1551,0,153. In 
most of these works, the bounds are asymptotic, i.e., they provide asymptotic scaling and 
typically require that the sparsity level k varies at some rate (linearly or sub-linearly) with 
the signal dimension p when n grows to infinity. It is worth mentioning that a careful 
normalization is needed, for instance of the sampling matrix and noise, when comparing 
these results in the literature. 

The paper [47^ was the first to consider the information-theoretic limits of exact spar- 
sity recovery from the Gaussian measurement ensemble, explicitly identifying the minimal 
SNR (or equivalently T = minjg7(j^^g) |2^o[^]|) ^is a key parameter. This analysis yielded nec- 
essary and sufficient conditions on the tuples (n,p, k, T) for asymptotically reliable sparsity 
recovery. This complements the analysis of [27] by showing that in the sub-linear sparsity 
regime, i.e. k = o{p), the number of measurements required by the Lassc0 n > k log(p — k) 
achieves the information-theoretic necessary bound. 

Subsequent work of 4^, 49, 0, El, [13,0,0, [Hi, S, l57fl has extended or strengthened 
this type of analysis to other settings (e.g. partial support recovery, other matrix ensembles, 
other scaling regimes, compressible case). 



1.4- Contributions 

Most of the results developed in the literature on sparsistency of the Lasso estimate 
exhibit asymptotic scaling results in terms of the triple {n,p, k), but this does not tell the 
whole story. One often needs to know explicitly the exact numerical constants involved in 
the bounds, not only their dependence on key quantities such as the SNR and/or other 
parameters of the signal xq. As a consequence, the majority of sufficient conditions are 
more conservative than those suggested by empirical evidence. 



^The adaptive Lasso as seen in the statistical literature turns out to be a two-step procedure, where 
the second step is to solve a reweighted £i norm problem, with weights given by the Lasso estimate in the 
first step. In fact, this is a special case of the iteratively reweighted ^i-minimization [4l| . 

*The shorthand notation / > <? means that g = 0{f). 



5 



In this paper, we investigate the theoretical properties of the Lasso estimate in terms 
of sparsity pattern recovery (support and sign consistency) from noisy measurements -the 
noise being not necessarily random- when the measurement matrix belongs to the Gaussian 
ensemble. We provide precise non- asymptotic bounds, including explicit sharp leading 
numerical constants, on the key quantities that come into play (sparsity level for a given 
measurement budget, minimal SNR, regularization parameter) to ensure exact or partial 
sparsity pattern recovery for both strictly sparse and compressible signals. Our results 
have several distinctive features compared to previous closely-connected works. This will 
be discussed in further details in Section 12.41 Numerical evidence are reported in Section [6] 
to confirm the theoretical findings. 

1.5. Organization of the paper 

The rest of the paper is organized as follows. We first state our main results and discuss 
the connections and novelties with respect to existing work. In Section [3] and 21 we detail 
the proofs for exact recovery with strictly sparse and compressible signals, before proving 
the partial support recovery result in Section [5j Numerical experiments are carried out in 
Sectional Section [H] includes a final discussion and some concluding remarks. 

2. Main results 

Our first result Theorem [T] establishes conditions allowing exact sparsity pattern re- 
covery when the signal is strictly sparse. Then, these conditions are extended to cover 
the compressible case in Theorem [2l In these two results, the role of the minimal SNR 
is crucial. Our third main result in Theorem [3] gets rid of this assumption in the strictly 
sparse case, but this time, the Lasso allows only partial recovery of the support. We also 
provide in this case a sharp ^2-consistency result on the Lasso estimate. 

The three theorems are stated following the same structure: suppose that {xo,w) fulfill 
some requirements formalized by a set y, then with overwhelming probability (w.o.p. for 
short) on the choice of A, the Lasso estimate obeys some property V. It should be noted 
that these theorems imply in particular that w.o.p. on the choice of A, for most vectors 
{xo,w) £ y, the Lasso estimate satisfies property V, whatever the probability measure 
used on the set y. 

The proof of Theorem [T] is given in Section [3l We prove its extension to compressible 
signals as stated in Theorem [2] in Section 3] Both proofs capitalize on an implicit formula 
of the Lasso solution. The proof of Theorem [3] given in Section [5] is quite different, since 
no such implicit formula is used directly. 

2.1. Exact Support Recovery with Strictly Sparse Signals 

Theorem 1. Let A £ M"^^ be a Gaussian matrix, i.e. its entries are i.i.d. J\f{0,l/n), 



w £ M" is such that \\w\\2 < e, < q,/3 < 1 and p > 6^(1-^) . Suppose that xq G W obeys 





k < 



af3n 



(6) 



^^ollo 



2 log J) 



and 
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Solve the Lasso problem from the measurements y = Axq + w. Then with probability 
P{n,p,a, (3) converging to 1 as n goes to infinity, the Lasso solution x{'y) with 



a/1 - a V n 

is unique and satisfies 

supp (x(7)) = supp (xq) and sign ^a;(7)^ = sign (: 



21ogp 



The proof (see Section [3]) provides an explicit bound for P{n,p,a, 13), showing in par- 
ticular that P{n,p,a, (3) is larger than 

1 - - , ^ - o ( - o(e-o-^v^) , 

2 2^^^ VogpJ ^ ' 

although this bound on the probability is far from optimal. 

In plain words, Theorem [T] asserts that for (a, j3) G [0, 1) the support and the sign of 
most vectors obeying ([6]) can be recovered using the Lasso if the non-zero coefficients of 
Xo are large enough compared to noise. This bound on the sparsity of xq turns out to be 
optimal, since for any c > 1, for most vectors xq such that ||xolio_ > 2 1 □ c p i support 
cannot be recovered using the Lasso even with no noise. Indeed, [331] and j5g | proved that 
the Lasso solution for any 7 shares the same sign and the same support as xq when y = Axq 
if and only if 

max\{aj,Ai{AjAi)-^sign{x^))\ < 1 . 

Note in passing the difference with the strict inequality in ([3]). On the other hand, if 
Ikollo > 2^ c > 1' then w.o.p. \\Ai{AjAi)-^sign{x^)\\l > for some C > 1 

and sufficiently large p. As a result, maxj^/ \{aj,Ai{AjAj)-hign{x^))\ > VC > 1. This 
informal optimality discussion is consistent with the information-theoretic bounds of ji^l, 
where it was proved that the number of measurements required by the Lasso achieves 
the (asymptotic) information-theoretic necessary bound that has the scaling ([6]) when the 
sparsity regime is sub-linear and ~ 1/ UxoHg. 

An important feature of Theorem [T] is that all the constants are made explicit and are 
governed by the two numerical constants a and /3. The role of a is very instructive since 
when lowering 7 by decreasing a, the threshold on the minimal SNR is decreased to allow 
smaller coefficients to be recovered, but simultaneously the probability of success gets lower 
and the number of measurements required to recover the /c-sparse signal increases. The 
converse applies when a is increased. On the other hand, increasing /? (in an appropriate 
range; see Section [3.31 for details) allows a higher threshold on the sparsity level, but again 
at the price of a smaller probability of success. 

2.2. Support Recovery with Compressible Signals 

Theorem [T] can be easily extended to weakly sparse or compressible signals. We consider 
the best A:-term approximation x^ of xq obtained by keeping only the k largest entries from 
Xq and setting the others to zero. Obviously, k = \L{x'')\. This is equivalently defined 
using a thresholding 

[ (J otherwise. 
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A signal is generally considered as compressible if the residual — xq is small. For 
sparsistency to make sense in this compressible case, additional assumptions are required, 
namely that the largest components of the signal are significantly larger than the residual 
— xq. This is made formal in the following theorem. 

Theorem 2. Let A, a, /3 and p as in Theorem QJ We measure y = Axq + w, and let x^ 
be the best k-term approximation of xq where k satisfies We denote 



A 



21ogp 



n 



Suppose that 



T as defined in @ is such that 



\w\\n +4 



Xq — X 



< e, 



and 



Xq — X 



T > 5.5Ae 



< -(1 - V^)Ae. 

oo 5 



(10) 



fill 



(12) 



Then, with probability P2{n,p,a, (3) converging to 1 as n goes to infinity, the solution x{^) 
of the Lasso from measurements y with 



7 = Ae 



(13) 



is unique and satisfies 

supp(x(7)) = supp ix^ 



and sign I x(7) I = sign I x 



Again, all the leading constants are explicit. Conditions (llip and (jl2p impose compress- 
ibility constraints on the signal, namely that the magnitude of the k largest components of 
Xq are well above the average magnitude ej ^Jn of the residual, and that the latter is "flat", 
since the ratio of its and norms should be small. 

The proof (see Section U]) provides an explicit bound for P2{n,p,a, 13), showing that 
P2{n,p,a, /3) is greater than 

1 - le-o-^v^ - , ^ - ( - o(e-°-^vT^) , 
2 2Vn^ VogpJ ^ ' 

although once again this bound on the probability is far from optimal. 

Theorem [2] encompasses the strictly sparse case, Theorem [H which is easily recovered 
by letting xq = x^. The parameter a plays a similar role in both theorems. Furthermore, 
in Theorem [21 the Lasso solution becomes more tolerant to compressibility errors xq — x^ as 
a decreases. This however comes at the price of a lower probability of success as indicated 
in our proof. 
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2.3. Partial Support Recovery with Strictly Sparse Signals 

In both previous theorems, the assumption on T plays a pivotal role: if T is too small, 
there is no way to distinguish the small components of xq from the noise; see also the 
discussion and literature review in Section 11.31 Nevertheless, if no assumptions are made 
on T, one can nevertheless expect to partly recover the support of xq. This is formalized 
in the following result. 

Theorem 3. Let A, a and /3 as in TheoremUl We measure y = Axq + w, where xq fulfills 
([6]). Then with probability P3{n,p,a, (3) converging to 1 as n goes to infinity, the solution 
x{'j) of the Lasso form measurements y with 

e / 2 log p 

is unique and satisfies 

supp(x(7)) C supp(3;o) . 
Moreover, the Lasso solution is i2- consistent: 

lko-x(7)||2< + e. (14) 

The proof in Section [5] provides an explicit lower bound for P3(n,]5, a, /?), and shows 
that i-*3(n,p, a, /3) is larger than 

"(l-v^-\/|)' 1 

1 - e 2 -^=^ . 

2V7r logp 

As before, this bound on the probability is not optimal. 



If 7 is large enough it is clear that supp(x(7)) C supp (xq) since for 7 > , 
x(7) = 0. Theorem [3] provides a parameter 7 proportional to £ that ensures a partial 
support recovery without any assumption on T. It also gives a sharp upper bound on 
^2-error of the Lasso solution. This result remains valid under the additional hypotheses 
of Theorem [1] or [2] allowing exact recovery of the support. 

2.4. Connections to related works 



Sparsistency. As we mentioned in Section II. 3| our work is closely related to 22, l27| , but 
is different in many important ways that we summarize as follows. 



Deterministic vs random measurement matrices: the work of |22| considers deter- 
ministic matrices satisfying a weak incoherence condition. Our work focuses on the 
classical Gaussian ensemble. 



Asymptotic vs non-asymptotic analysis: the analysis in [27| applies to high-dimensional 
setting where even the sparsity level k grows with the number of measurements n. 
As a result, k appears in the statements of the probabilities, which thus requires 
that k —7- +00. This is very different from our setting as well as that of (23 | where 
the probabilities depend solely on the dimensions of A. We believe that this is more 
natural in many applications. 
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Random vs deterministic noise: in both previous works, the noise is stochastic (Gaus- 
sian in jl^l and sub-Gaussian in (2?!). In our work, we handle any noise with a finite 
£2-norm. 

Leading numerical constants: these are not always explicit and sharp in those works. 



The constant involved in the sparsity level upper-bound in |22l . Theorem 1.3] is not 
given, whereas ([6]) gives an explicit and sharp bound. The bounds ([7]) and ([8]) on T 
and 7 are similar to those given in |22|, Theorem 1.3] once specialized for a = 3/4. 



In 



271 . Theorem 2], the constant appearing in the lower-bound on T is not given, 
whereas ([7]) provides an explicit expression that is shown to be reasonably good in 
Section [H 

• Compressible signals: to the best of our knowledge, the compressible case has not 
been covered in the literature, and Theorem [2] appears then as a distinctively novel 
result of this paper. 

• £2-consistency: such a result is not given in those references. A bound on the £2- 
prediction error on Axq — Ax{'y) is proved in ^22j . An £oo-consistency is established 



m 



27| . which is an immediate consequence of sparsistency. Our method of proof 
differs significantly from the one used in [27||, and in particular it naturally leads to 
the ^2-consistency result. 



Exact and partial support recovery: in [22] the partial recovery case was not con- 



sidered. In [27|, exact and partial recovery are somewhat handled simultaneously, 



while we give two distinct results for each case. 

£2- consistency. This property of the Lasso estimate has been widely studied by many 
authors under various sufficient conditions. Theorem [3] may then be compared to this 
literature, and we here focus on results based on the restricted isometry property (RIP) 
(59I and more or less similar variants in the literature; see the discussion in [33 | and the 
review in js^. 

The RIP results are uniform and ensure ^2-stability of the Lasso estimate for all suf- 
ficiently sparse vectors from noisy measurements, whereas Theorem [3] guarantees that the 
Lasso estimate is ^2-consistent for most sparse vectors and a given matrix. When A is 
Gaussian, the scaling of the sparsity bound is 0{n/ \og{p/n)) for RIP-based results which 



is better than 0(n/ log p) in Theorem O Note that the scaling 0{n) was derived in [60| 
when A belongs to the uniform spherical ensemble to ensure ^2-stability of the Lasso esti- 
mate for most matrices A, although the leading constants are not given explicitly. However, 
the RIP is a worst-case analysis, and the price is that the leading constants in the sufficient 
sparsity bounds are overly small. In contrast, the leading numerical constants in our spar- 
sity and ^2-consistency upper-bounds are explicit and solely controlled by (a,/?) S [0,1)^. 
For instance, it can be verified from our proof that the value of the sparsity upper-bound 
we provide is actually larger than the bounds obtained from the RIP for p up to e^"^. 
Finally, the RIP is a deterministic property that turns out to be satisfied by many ensem- 
bles of random matrices other than the Gaussian. Our Theorem [3] could presumably be 



extended to sub-Gaussian matrices (e.g. using |6ll . Corollary V.2.1]), but this needs further 
investigation that we leave for a future work. 

3. Proof of Support Identification of Exactly Sparse Signals 

This section gives the proof of Theorem [TJ Recall that x is the restriction of x to its 
support /(x), and Ai the corresponding sub-matrix. We also denote the Moore- Penrose 
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pseudo-inverse of Aj as 

At = {AjAj)-'Aj. 

3.1. Optimality Conditions for Penalized Minimization 

From classical convex analysis, the first order optimality conditions show that a vector 
X* is a solution of the Lasso if and only if 

Aj {y - Ax*) = 'ysign {^) 

^j^I, \{a„y-Ax*)\<j, ^ 
where / = I{x*). 

Hence if the goal pursued is to ensure that I{x*) = /(xq) = / and sign (x*) = sign (xq), 
the only candidate solution of the Lasso is 

x^ = x^ --f{AjAi)~^ sign (x^) + Ajw. (16) 

Consequently, a vector x* is a solution of the Lasso if and only the two following conditions 
are met : 

sign (xo) = sign (x*) (Ci) 
Vj ^ I(xo), \{aj,^d{xo)+Py^^{w))\ < 7 (C2) 

where Vj = Span(A/), Py^i- is the orthogonal projection on the subspace orthogonal to Vj, 
and d{xo) is defined in 

Sections 13.21 and 13.31 show that under the hypotheses of Theorem [H conditions (Ci) 
and (C2) are in force with probability converging to 1 as n goes to infinity. This will thus 
conclude the proof of Theorem [TJ 

3.2. Condition (Ci) 

To ensure that sign (xq) = sign (x*), it is sufficient that 

\\-fiAjAi)-hign (x^) + A+w\\^ < T . (17) 

We prove that this is indeed the case w.o.p. . 

Lemma IH whose proof is given in Appendix IA.3[ shows that 7 = ^ \f^^^ — WE 
implies 

iWiAjAi) 'sign(xo)|L< 55 

with probability greater than 1 — kp — 2e "logp 

To prove (fT7|) . we will now bound ||yl^?i;||^. To this end, we split it as follows 

11^/ ^IL = Di X D2 X D3 X \\w\\2 , 

where 

Mtwll |Uti(;|L MTu;|L 

11^/^112 11^/^112 "^"2 
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Bounding Di. As A and w are independent, Lemma [5l proved in Appendix \AA\ shows 
that the distribution of Afw is invariant under orthogonal transforms on R'^. Therefore 
the random variable 



AJw 

I^>ll2 



is uniformly distributed on the unit £2 sphere of M'^ 

Using tl: 
follows that 



Using the concentration Lemma [71 detailed in Appendix IB} with e = ( ^ ^ ) ^ , it 



1 



> 1 - max (^4n-t,8e-V2iog(2")j . (is) 

One can notice that Di < 1 actually gives a better bound if k is small compared to n. 
Moreover the bound on the probability is 1 — 4n~3 for k big. 

Bounding D2- D2 is bounded by the maximum of the eigenvalue of {Aj Ai)~^ . Indeed, 
owing to Lemma [3] with i = 1 — a M — 2~s , we arrive at 



1-2"^-- 



1 , \2 



P (L>2 < 24 ) > 1 - e 2V VT^J . (19) 

Bounding D3. Let's write 



Since each {ai,w) is a zero-mean Gaussian variable with variance ^ ^ , the variable 

\A^i 



n \\AjW\\t 



I l|2 ' 

\W\\r, 



follows a distribution with k degrees of freedom. Therefore, in virtue of the concentration 
Lemma [HI stated in Appendix [HI applied with 



1 + 5 = 2^ 

we obtain 



log n 
log k 



P (dI < ^^V^og^^ > 1 _ _J_,-K^-'^-'^--M^)) > 1 - le-0-7v^ 
V ~ nVlogI J ~ ~ 2 

This last bound may be pessimistic; when k is large this probability is actually much bigger. 
This shows that w.o.p. , 

[2k /logn\ 4 , , 

Putting (HE]), (HI]) and ([201), we conclude that 



l^>L S (21) 
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with probability greater than 

;^_lg-o.7v/k5i7I_g-f(i-2-H-^^) _n,ax(4n-5,8e-V2i°g(2")^_^p-i.28_2e- ""'Vio^'^'' 

which converges to 1 as n — )• +00. 

In turn, the bound (I2ip becomes, under assumption ([7} on T, 



II / lloo - 5^5 

This shows that condition (Ci) is in force with probabihty converging to 1 as n — )• +00. 

3.3. Condition (C2) 

Let's introduce the following vector 

u = -fd{xo)+Py^±{w), (22) 

which depends on both xq and w. 

Clearly, to comply with (C2), we need to bound {{cLj^u))j<^i w.o.p. . We will start by 
bounding ||m||2- 

Bounding \\u\\2. As d(xo) G Vi, the Pythagorean theorem yields 

\\u\\l = ^''\\d{x^)\\l+ Py^^{w) \. (23) 



Let 5 = sign (xq). Then 



nk n 115*112 



\\d(x^)f, 5T(ylTyl,)-i5- 
Since xq and A are independent, Lemma [6l stated in Appendix [Bl shows that „ ,,"^,,12 is 

I|aw)ll2 

X^-distributed with n — /c + 1 degrees of freedom. Thanks to Lemma El see Appendix |Bl it 
follows that for all > 0, 

nk , ,, ,,,o\ (n-fc + l)log(l- 



p 



Since ^ < 2T3^) '^^ obtain for p > eas, 

p(fc<||d(xo)||^(l-5)2) <e^'^ 



/ II ,/ Ml2 ^ ^\ . n(3-v/?)log/3 

P ||(i(xo)||2 < 1 > 1 - e ^ 



2 - ^ 
2-/3 



Choosing 5 such that (1 — (5) > -y/^, we have 

1 11" V—u/ 1I2 — ^ 

This shows that 

l|dMII^< 

with probability converging to 1 as n — t- +00. 

1 

It is worthy to mention that the condition p > 6^(1-^^) actually guarantees the existence 
of a suitable b. 

As Py^i- is an orthogonal projector, we have 
), this shows that 

,,o n/c a\ n(3-V?) log /3 



< 11^^112 — ^- Together with 
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Bounding maxj^j \{u, aj)\. For a fixed u, the random variables {{aj,u))j^j are zero-mean 



1 1 1 2 

Gaussian variables with variance ^ 



Using the bound (f2ll) . traditional arguments from the concentration of the maximum 
of Gaussian variables tell us that 



inax I (a, ,u)\ < J ^ (7^ | + ) (25) 



with a probability larger than 



n(3-V?)log/3 1 

1 — e 16 



2^/^T]ogp 

In turn, this implies that condition (C2) is in force w.o.p. if 



l2\ogp f^,k 



n V P 



This holds if 



21ogp 

< 7- 



y/l- a V n 

This concludes the proof of Theorem [H and shows that overall 

na(0. 75^2-1)^ n(3-^) log /3 



P(n,p,Q,/3) > l--e-°-^^^-e ^ ^ ^^^^i^i - max (4n~5 , 8e-V2iog(2n) 



/cp ^'^^ — 2e — e 16 



2^/^^\ogp 



4. Proof of Support Identification of Compressible Signals 

To prove this theorem, we capitalize on the results of Section 13.11 by noting that y = 
Ax^ + A{xo — x^) + w := Ax^ + Ah + and replacing xq by x^ and w hy W2 = Ah + w. 
With these change of variables, it is then sufficient to check conditions (Ci) and (C2) with 
the notable difference that the noise W2 is not independent of A anymore. More precisely, 
W2 is independent of {ai)i^j but not of {aj)j0. 

Condition (Ci). Since this condition only depends on A/, it is verified with probability 
converging to 1 as n — t- +00, as in the proof of Theorem [H provided that T > 5.67 and 

||ii'2||2 ^ ^5k^p"- '^^^ ^'-^^ condition is a direct consequence of assumptions (llip and 
([13]). Moreover, ||tL'2|l2 — 11^112 + 11^^112' where Ah is a zero- mean Gaussian vector, whose 
entries are independent with variance Therefore "||^^|^|^ has a distribution with n 

degrees of freedom. We then derive from the concentration Lemma [8] that 

P{\\Ah\L < 2 \\h\L) > 1 ^e-°-S'" . 

^ ^ 3\/2^ 



Under assumptions (llOp -( fTT]) . the last inequality implies that 

||w^2|l2 ^ Il'"^ll2 + ^ 11^112 ^ ^ ^ 
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with probability that tends to 1 as n — t- +cxd. Condition (Ci) is thus satisfied with a 
probabihty larger than 



nQ(0. 75^-1)^ X 

3\/27rn 



2e 4logP ; p-0.8n 



Condition (C2). For any j ^ /, define the vector vj = W2 — h[j]aj. In particular, Vj is 
independent of Oj. Condition (C2) now reads: 



where the vector d{x^) is defined replacing xq by x'^ in 
Similarly to (|24p . it can be shown that w.o.p. 



2 



2'*' , II ||2 



2 p ^ 



On the other hand, \\vj\\2 < ||?^2||2 + ll^lloo Il^jll2' '^Il'^ill2 ^® x^'distributed with n 
degrees of freedom. Applying Lemma [8] to bound ||aj||2 by 2 for all j and using similar 
arguments to those leading to ([25]) . we get 



m^^\{a„^d{x') + Py^.iv,))\ < (7^1 + (lkll2 + 4 11/^112)^ 

with probability larger than 1 p+^^-OSn _ 1 converging to 1 as n — )• +00. It 

o 3V27r?i 2^71' log p' ^ " 



then follows from assumptions (|T0|) and (|13p that w.o.p. 



max|(a„7d(x^) + Pv^,x(7;,-))| < + • (26) 



As an orthogonal projector is a self-adjoint idempotent operator, we have for all j < 

|/i[j](a„P^^x(a,))| < l|/i|loo||^y,4«.-)|[, 

2 

where -fv}i (oj ) is the squared £2-iiorm of the projection of a Gaussian vector on the su 

space V/"*" whose dimension is n — fc. As V/"*" is independent of aj, for j ^ I, n Py^x(aj) 
follows a distribution with n — k degrees of freedom. Using Lemma [8] together with 



assumptions p2j) -p3|). the following bound holds w.o.p. 

max\h[j]{aj , Py^±{aj)) \ < 2.5 



oo<|(l-V^) 



(27) 



In summary, (j26p and (|27p show that (C2) is fulfilled with probability larger than 



1 .g-0.8n 



3v 27rn 



1 .g-0.3n 



3v 27rn 



i/27r(ji-A:) 



-0.009n 
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5. Proof of Partial Support Recovery 

To prove the first part of Theorem [3l we need to show that with w.o.p. , the extension 
Xi{'y) on RP of the solution of 

1 2 

min - llyi - yl/x|L + 7 ||x||-, (28) 

2 ^ ^ 



with yi = PA_i{y), is the solution of the Lasso. By definition, the support J of this extension 
is included in /. 

Proving this assertion amounts to showing that Xi(7) fulfills the necessary and sufficient 
optimality conditions 



A^{y - Axii'j)) = Tsign (3:1(7)^ 
V/^J, \{ai,y-Axi{-f))\<^. 



(29) 



Since yi = PAj{y) and J C I, A^{y — Axi{'y)) = A'j{yi — Axi{'j)). In addition, as 2:1(7) 
is the extension of the solution of (|28p . the optimality conditions associated to (|28p yield 



I -4j(y - ^3:1(7)) = 7sign (^xi{j)^ , 
\vZG(/nJ^), \{ai,y-Axi{-f))\<-f. 

To complete the proof, it remains now to show that w.o.p. 

V;^/, \{ai,y-Axii-f))\<j. (30) 

As in the proofs of Theorems [T] and [21 to bound these scalar products, the key argument 
is the independence between the vectors (a/);^/ and the residual vector y — Axi{'y). 

We first need the following intermediate lemma. 

Lemma 1. Let A G M"^*-' such that (A^A) is invertible. Take ^(7) as a solution of the 
Lasso from observations y G M". The mapping f : — )• 7 1— )• /(7) = ^-^(7)112 
well-defined and non-increasing. 

Proof: The authors in Q and (ssl independently proved that under the assumptions 
of the lemma: 

• the solution ^(7) of the Lasso is unique; 



• there is a finite increasing sequence {'yt)t<K with 70 = and = W^'^vWao such 
that for all t < K, the sign and the support of x{'y) are constant on each interval 
(7i,7t+i). 

• x(7) is a continuous function of 7. 
Moreover ^(7) with support J satisfies 

l^ = A+y- 7(^T^j)-isign (^) , (31) 

which implies that 

r(7) := 2/ - Ax{j) = P^^{y) - 7^j(^J^j)-^sign . 
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Therefore, on each interval (7t,7t+i), r(7) is an affine function of 7 which can be written 

r(7) = z — 7U, 



where z := P4±(y) and v := Aji^A^jAj) ^sign^x(7)j. As f G Vj and z G Vj", the 
Pythagorean theorem allows to write for 7 G (7t,7t_|_i) that 

IH^ = M + ||„||^. (32) 



7 7 



We then deduce that /(7) = ^^'"^^^^^^ is a non-increasing function of 7 on each interval 
(7^,7(4.1). By continuity of /, it follows that / is non-increasing on M"^*. 



Remark 1. If{AjAj) is not invertible, the Lasso may have several solutions. Nevertheless 
r(7) is always uniquely defined and the lemma should also apply. 



\\yi-Axi{'y)\ 
7 

yi G Vj and Aj has full column-rank, we also have 



From Lemma [H we deduce that — — ^ ^ is a non- increasing function of 7. Because 



lim 2:1(7) = ^1) 

7-5-0 

where on /, the entries of xi are those of the unique vector of M}^^ such that Ajx = yi. 
Therefore, 

xi[i\ = xo[{\ + {Afw)[i\, for iel. (33) 

Since Aj is Gaussian and independent from xq and w, the support of xi is almost surely 
equal to I. Hence there exists 71 > such that if 7 < 71, the support and the sign of xi{^) 
are equal to those of xi. More precisely, if 7 < 71, ^1(7) satisfies 

xii'y) = W- 7(Af A/)^^sign (xT) and r{-f) := yi - Axi{^) = 7A/(^J^/)^^sign (^) . 
It then follows that for 7 G (0,71), 

ML 

\Ai{AfAi) 'sign(xT) 



|yi -^xi(7)||2 _ II . r aT A 



7 

Now, since 

\\Ai{AjAi)-hign = ((^?^/)"^sign (xl) , sign(xT)), 

we deduce that for all 7 > 0, 

\\yi -Axi{-f)\\2 
7 



< J\I\p{{AjAj)-^] 



where p((^j^/) ^) is the spectral radius of (A J^/) ^. Using Lemma[3]witli /3 < - 
then leads to 

^/to-Ax.Mii,^ /i\ ^^^^ 
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By the Pythagorean theorem and the fact that 



Py^±W 



< e, we have 



\y-Axi{-f)\\l = \\y - yiWl + hi - "^^liiMll 



+ \\yi - Axi{^)\\l 



< + \\yi - Axi{j)\\l . 
With similar arguments as those leading to (|25p . it can then be deduced that 



max I (a/, y - Axi{j))\ < 
If I 



' 2 log p 



n 



with probability larger than 1 — e 



1 



2v7rlogp' 



If < ^ and 7 > 



2 logp 



then 



21ogp( + 



/3 



(35) 



< 7, and therefore inequality 



a V " 

30l) is satisfied w.o.p. . This ends the proof of the first part of the theorem 



Let's now turn to the proof of (I14p . To prove this inequality we notice that for large 
7, the Lasso solution ^(7) is also the extension of the solution of (j28p w.o.p. and we use 
the Lipschitz property of the mapping 7 1— t- 2:1(7). 

Indeed, by the triangle inequality, 



lla^o - 2:1(7)112 < Iko - xiW^ + ||xi - a;i(7)||2 
Recalling from (I33p that xq —Hci = Afw, it follows that 



\\xo - xiW^ < pHAJAi)--^), 
which, using again Lemma [3l leads to the bound 

\\xo — xi II2 < 2s 

-f (0.5- 

witli probability larger than 1 — e V 

For all 7 > 0, 2:1(7) obeys (131 p . and since lim^^o2:i(7) = xi, we get that 
||xi-xi(7)||2 <7 max 11 (^J^j)-^S|| 



(36) 



(37) 



For all J C I, the inclusion principe tells us that p{{A^Aj)-^) < p{{AjAi)-^). Further- 
more, for all S £ { — Ijl}!"^!, ||5'||2 < Vk- Using Lemma [3] once again implies that 



P ||2;i-xi(7)||2<7i 



> 



If 7 



/l-a 



^^2SP and k < , then w.o.p. 

n — 2 log p ' 



- a;i(7)ll2 ^ ^ 



a 



l-a 



This concludes the proof. 
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6. Numerical Illustrations 



This section aims at providing empirical support of the sharpness of our bounds by 
assessing experimentally the quality of the constants involved in Theorem [TJ More specif- 
ically, we perform a probabilistic analysis of support and sign recovery, to show that the 
bounds ©, dHI) and dH) are quite tighJl. 

In all the numerical tests, we use problems of size (n,p) = (8000,32000) and {n,p) = 
(3000, 36000), corresponding to moderate and high redundancies. These are realistic high- 
dimensional settings in agreement with signal and image processing applications. We 
perform a randomized analysis, where the probability of exact recovery of supports and 
signs (sparsistency) are computed by Monte-Carlo sampling with respect to a probability 
distribution on the measurement matrix, /c-sparse signals and on the noise w. As detailed 
in Section II. !( the matrix A is drawn from the Gaussian ensemble. We assume that the 
non-zero entries x[i] for i G I{x) of a vector x G are independent realizations of a 
Bernoulli variable taking equiprobable values {+T, —T}. We also assume that the noise w 
is drawn from the uniform distribution on the sphere {w £ \ || = e}. Since only the 
SNR matters in the bounds, we fix e = 1 and only vary the value of T. 




200 300 400 



500 



600 



100 
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(n,p) = (8000,32000) 



(n,p) = (3000,36000) 



Figure 1: Probability of sparsistency as a function of k and a — 0.8. The vertical lines corresponds to our 
sparsistency bound kp, from left to right, for /3 = 0.7, 0.8, 0.9, 1. 



Challenging the sparsity bound ([6]). We first evaluate, for a 
of k, the probability of sparsistency given that 



0.8, and for a varying value 



5.5e 



a 



2 logp 



n 



and 7 = — 



T 
5^ 



(38) 



which are values in accordance with the bounds ([7| and ([8]). 

In order to compute numerically this probability, for each k, we generate 1000 sparse 
signals xq with HxqUq = k, and check whether conditions (Ci) and (C2) defined in Sec- 
tion 13.11 are satisfied. Figure [T] shows how this probability decays when k increases. The 



^The Matlab code to reproduce the figures are freely available for download from 
|http : //www . ceremade . dauphine . f r/~peyre/ codes/ [ 
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vertical lines correspond to the critical sparsity thresholds 



a/3n 



(39) 



2 logp 



as identified by the bound ([6]). The estimated probability exhibits a typical phase transition 
that is located precisely around the critical value kp for f3 close to one. This shows that our 
bound is quite sharp. We also display the same probability curve for other, less conservative, 
values of 7 G {T/A, T/2}, which improves slightly the probability with respect to 7 = T/5.5. 

Challenging the regularization parameter value ([8]). We evaluate, for {a,/3) = (0.8,0.8), 
the probability of sparsistency using a value of 7 different from 



given in jH]), for which Theorem [T] is valid. We use the critical sparsity level k = kp defined 
in (I39p . To study only the infiuence of 7, we use a SNR that is infinite, meaning that e 
is negligible in comparison with T. This implies in particular that in this regime, only 
condition (Ci) has to be checked to estimate the probability of sparsistency. 

Figure [2] shows the increase in this probability as the ratio 7/70 increases. This makes 
sense because the signal is large with respect to the noise so that a large threshold should 
be preferred. One can see that at the critical value 7 = 70 suggested by Theorem [H this 
probability is close to 1. This again confirms that the value (l8|) of 7 is quite sharp. 




0.5 1 1.5 0.5 1 1.5 



Figure 2: Probability of support recovery for large T as a function 0/7/70 for k = kp and (a, j3) = (0.8, 0.8) . 

Challenging the signal-to-noise ratio ([?]). Lastly, we estimate, for (a,/3) = (0.8,0.8), the 
minimal signal level T that is required to ensure the inclusion of the support, meaning that 
I{x{'y)) C /(xo). We use the critical sparsity k = kp and 7 = 70, with kp and 70 as defined 
respetively in (I39p and (j40p . Since we are only interested in support inclusion, it is only 
needed to check condition (C2). 

The bound in ([7]) suggests that T > 5. 570 is enough. Figure [3] however shows that this 
bound is pessimistic, and that T > 270 appears to be enough to guarantee the support 
inclusion with high probability. A few reasons may explain this sub-optimality. 

• There is no guarantee that the concentration lemmas we use are optimal. 




(40) 



(n,p) = (8000,32000) 



(n,p) = (3000,36000) 
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The limit ratio j relies mainly on Lemma 3] and especially on the bound 1 + 4-v/6 in 
it. This bound can be improved by at least three ways. 



— Using the same proof, the bound can be slightly enhanced by decaying the 
probability of success. 

— The result in the lemma is non-asymptotic. The bound and the probability were 
computed to be available for all a < < 1 and for all p > 1212. With the 
values used in the numerical experiments, and decaying a bit the probability of 
success, the bound can turn into 1 + 2.7\/6, yielding a better bound T > 4.377o. 

— In the proof of Lemma HI the inequality ||-Bi||2 < p{B), is used, where p{B) is 
the spectral radius of B. This bound is available for any matrix, but one might 
perhaps do better by exploiting Gaussiannity of the measurement matrix. 




0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 



Figure 3: Probability of support inclusion as a function o/T/70 for k — and {a, 13) = (0.8,0.8). 
Conclusion 

This paper has presented a novel analysis of the sparsistency of the Lasso from noisy 
Gaussian measurements. We derived sharp bounds on the sparsity of the signal to guarantee 
sparsistency with high probability. This result is extended to handle compressible signals 
and to establish sharp ^2-consistency. A distinctive feature of our analysis is that it provides 
explicit constants for the three key parameters of the problem: the sparsity of the signal, 
the minimal signal-to-noise ratio and the Lasso regularization parameter. Numerical results 
support the claim that these constants are either sharp or at least reasonably well behaved. 

A. Properties of Wishart Matrices 

A.l. Signs of non- diagonal entries of an inverse Wishart matrix 

Lemma 2. If B € M^^'^ is the inverse of a Wishart matrix, then for alii < k, the variables 
(sign (Bij) ,j ^ i) form a Rademacher sequence, that is they are independent and uniformly 
distributed on { — 1,1}. Moreover this sequence is independent of Bi^i , and of {\Bij\)j^i. 

Proof: li B = {Bij)i<f: j<k £ M^^'^ is the inverse of a Wishart matrix, then 
B = (A^A)-^ where A e Mn,k{^) is a Gaussian matrix. Let E G Aik,k(M) be di- 
agonal such that for all 1 < z < k,\Ei^i\ = 1. Then {AE)^AE = EA^AE, hence 



(n,p) = (8000,32000) 
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{{AE)^ {AE))-^ = E{A^A)-^E. Therefore the entries oi C = {{AE)"^ AE)-^ are dj = 
Ei^iEjjBij for I <i,j <k. 

But A and AE have the same law, hence B and C also have the same law. Hence for all 
i^j)j<k,j7^i e {-1) the laws of (_Bj,i, . . . , Bi^k) and (eiSj^i, . . . , Bi^i, ... , euBi^k) are the 

same. This implies that the variables (sign (Bij) ,j ^ i) form a Rademacher sequence, and 
this sequence is independent of Sj^j, and of {\Bij\)jjLi. 

A. 2. Extreme eigenvalues of a Wishart matrix 



The proof of the following lemma can be found in |62l . page 42] . 



,2 




Lemma 3. If A ^ M"^'^ is a Gaussian matrix whose coefficients are centered of variance 
then the maximal and minimal eigenvalues of the Wishart matrix B = A'^A satisfy for 
allt>0 

P (^Xm.AB)> (^l + \f^ + t] \<e ^ 

and 

A. 3. Sup-norm of a projected Rademacher sequence 

Lemma 4. If C G R"^'^ is a Gaussian matrix, with k < with < b < 1 and if 

J 7 — 2 log p — 

S G { — 1, 1}'^ is drawn independently from C , then if p > 1212, 

P (\\{C^C)-^S\\^ < 1 + 4Vb) > 1 - kp-^-^^ - 2e . 
Proof: We use the following splitting 

(C'^C)-^ = 1+ {{C^C)-^ -I)=I + B. 

This shows that 

\\{C^C)-^S\\^ < \\S\\^ + \\BS\\^ = 1 + \\BS\\^ . 

One can then observe that {BS)[i] = ^j<^k\Bi,j\S[j]sign (Bij); one has Bi^i > 0, 
and according to Lemma [21 for given i, the variables sign {Bij)-^- form a Rademacher 
sequence (this means that they are independent and uniformly distributed on { — 1,1}), 
and this sequence is independent of Bi^i and of (|i?ij|)jy:j. Hence one can apply Hoeffding's 
Lemma fTOl (multiplying the line by an independent variable uniform on { — 1,1} to take 
care of the fact that sign (i?j^j) is not uniformly distributed), thus getting for any i < k 
and any t > 0, 

/ h \ ^ 

< . (41) 

Now, for all i < k, ||-Bj||2 < p{P)i where p{B) is the spectral radius of B. Using 





k 

















Lemma [3] with i — {^-"^^ ~ '^)y and the fact that ^ < jiogp' S^t 



2N 

P \ XmUC^C)< [1-0.75 J \<e 
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Consequently 

P |^A^ax((CTC)-i) > - 0.75^ 
Similarly, we have 

P I A^in((CTc)-i) < I 1 + 0.75. 



logp 



-2N 



(0.75v^-l)^i)n 

< e 4iogp 



-2N 



logp 



(0.75^-l)^ftn 
< e 4 logp 



Titi((). 75^2-1)-^) 

It finally follows that with probability larger than 1 — 2e , 



p{B) < max 



1 + 0.75. 



logp 



(1-0.75. 



logp 



In particular, taking iH^P) > ^.^129)2 ~ '^'^'^ leads to /^(-B) < '^■'^\Ju^ with probability 

716(0.75^2-1)^ 

greater than 1 — 2e '"°sp 

Using this bound in (j4ip with f = 1.6y^log(p) yields 



(ll^^lloo > < P ll^^lloo > * Pill2 and < 2.5. 



b 



logp 



P p{B) > 2.5 



6 



< kp~^-'^^ + 2e 
If we set > 7.08, the following holds, 

/ll, ^ 1 II ^\ 1 9S nh(0. 7572-1)2 

P ( (C7TC)"^S < 1 + AVb) > 1 - kp-^-^^ - 2e 



logp / 

n6(0.75\/2-l)2 



Remark 2. It is worth noting that if > 16.2 as in the numerical experiments (b = 
0.64, p = 32000 one can adapt this proof and, by loosing a bit on the probability (i.e. 
applying the concentration lemmas with smaller values oft), one can get ^{C'^ C)^^ < 
1 + 2.7Vb w.o.p. . 

A. 4. Rotation invariance 

Lemma 5. If C £ W^^^ is a Gaussian matrix, and w G is independent of C, the law 
of C^w is invariant under orthogonal transforms on M'^. 

Proof: If C G M"^*"' is a Gaussian matrix, then for any orthogonal matrix U G M'^^'^, 
D = CU and C have the same distribution. The law of D^w and C~^w are thus the same. 
Since for all w, one has 

D+w = U~^C^w, 
the law of U^^C^w is the same as that of C^w. 
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A. 5. Distribution of a quadratic form 

The following lemma is a consequence of (gsI . Theorem 3.2.12]. 

Lemma 6. If B is a Wishart matrix as described in Lemma 0, then for all X G 

n\\X\\l 



independent of B, the random variable pj^^y follows a distribution with n — k + 1 



degrees of freedom. 



B. Concentration inequalities 



The following lemma is well known; a proof can be found in [64|. 
Lemma 7. Let denote the uniform probability on the unit sphere S'^^^ in W^, and let 

k ^ 

A C 8''-^ such that fikiA) > \. Then Hk{{x £ S''"\d(x,A) < e}) > 1 - 2e"^. As a 
corollary, G < e} > 1 — 4e~~. 



The following lemma is due to Cai et Silverman, see [65|. 
Lemma 8. If X follows a distribution with k degrees of freedom, then for all 5 > 0, 

P{X>{1 + 6)k) < — ^e-i(^-'°s(i+^» 
ylirkd 



The following lemma is due to Hoeffding, see |66| . 

Lemma 9. If X follows a distribution with k degrees of freedom, then for all 6 > 0, 

kiogn-S) 
P{X < il-6)k) < e 2 

The following lemma can be obtained by applying the Chernoff-Hoeffding inequality. 

Lemma 10. // {£i)i<k is a Rademacher sequence, then for all a = {ai)i<^i^ £ and for 
all t > 0, 

k 



P 



i=l 



> t ||a|l2 ^6 ^ 
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